-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
more catching of build errors, dump of logs, etc.; add registry pod d… #8591
Conversation
[testonlyextended][extended:core(builds)] |
lgtm, needs @smarterclayton approval. and let's make sure we have a plan to strip all this out (aside from the waitforabuild calls which seems like a good general improvement) once we get to the bottom of this. |
Why do we need to merge this vs just running testonlyextended? |
@smarterclayton I'm in agreement with not merging at this time, especially given the lastest test results. The df/du of registry pod along with the df of the ec2 instance shed some light on the situation, though the root cause is still TBD. Specifically:
ec2 df: registry df: Those two entries were immediately after a push to the registry failed with the same message reported earlier: time="2016-04-21T21:51:06.639377806Z" level=error msg="response completed with error" err.code=UNKNOWN err.detail="filesystem: write /registry/docker/registry/v2/repositories/extended-test-build-image-source-eoxy7-hoeui/imagesourceapp/_uploads/30d40477-3875-4f92-b607-1e6b418f4d79/data: disk quota exceeded" err.message="unknown error" go.version=go1.6 http.request.host="172.30.131.224:5000" http.request.id=941f71c2-a05f-4830-a6d7-cb885abcb54f http.request.method=PATCH http.request.remoteaddr="172.18.12.132:34979" http.request.uri="/v2/extended-test-build-image-source-eoxy7-hoeui/imagesourceapp/blobs/uploads/30d40477-3875-4f92-b607-1e6b418f4d79?_state=KKbjEmD56jWJfd8GXI1LpPcAM2QkR8GiEg2V9UrfdlN7Ik5hbWUiOiJleHRlbmRlZC10ZXN0LWJ1aWxkLWltYWdlLXNvdXJjZS1lb3h5Ny1ob2V1aS9pbWFnZXNvdXJjZWFwcCIsIlVVSUQiOiIzMGQ0MDQ3Ny0zODc1LTRmOTItYjYwNy0xZTZiNDE4ZjRkNzkiLCJPZmZzZXQiOjAsIlN0YXJ0ZWRBdCI6IjIwMTYtMDQtMjFUMjE6NTA6NTguMTkxMzI5MjVaIn0%3D" http.request.useragent="docker/1.9.1 go/go1.4.2 kernel/3.10.0-229.7.2.el7.x86_64 os/linux arch/amd64" http.response.contenttype="application/json; charset=utf-8" http.response.duration=8.298964494s http.response.status=500 http.response.written=293 instance.id=63aa5205-b8fb-4cf4-ad4e-12d25be0054b vars.name="extended-test-build-image-source-eoxy7-hoeui/imagesourceapp" vars.uuid=30d40477-3875-4f92-b607-1e6b418f4d79 The du /registry output on the registry pod matched up with the data from df. A mixture of image entries, secret entries, etc. A decent number of files, none overly big. So either the entry that causes the problem is exceptionally large, and is then cleaned up as part of error processing, hence gone before we can run df/du, or the message details about disk quota being violated are misleading somehow (or we are misinterpreting what it means). As @bparees and I discussed earlier, I think we've done due diligence here. It is time to pull in the team that owns the registry, and get their help in deciphering what is going on. I'm assuming we can add whatever debug they would need into this PR, and rerun the extended test. Thoughts? |
@pweil- during the build/image extended tests (both with the periodic jobs from from https://ci.openshift.redhat.com/jenkins/job/origin_extended_build_tests , as well as individual PRs like this one) we have been getting an intermittent 500 on pushing images to the internal registry. When examining the registry logs, we see messages at the corresponding times to the push failures in the builder logs like the ones noted in my prior comment. However, while the message implies a disk space / quota issue, our examinations of the file system in particular show plenty of space. Are the logs for the last test run sufficient for your team to provide some insight? Or do we need to re-run with some additional debug? thanks |
Looking... |
26c0262
to
cc744a9
Compare
hey @soltysh - have you had a chance to repro / look at this any? Is there some debug you would like me to add to the PR and retest? |
cc744a9
to
f2c3d59
Compare
@gabemontero sorry, didn't get to it, yet. I wanted to finish job-related stuff first. Hopefully I should have some time tomorrow to look at it. |
Cool no worries @soltysh On Thursday, April 28, 2016, Maciej Szulik [email protected] wrote:
|
f2c3d59
to
2e76fdd
Compare
So a little internet research turned up a docker support thread, that while unresolved, showed some similarities with what we are seeing: https://meta.discourse.org/t/disk-quota-exceeded-when-chown-ing/36500/20 Based on this and another couple of similar hits, I"m going to add a call to If not, @csrwng had the suggestion that maybe selinux is playing a part. Perhaps I'll see about disabling it in the PR test runs, see if any change in behavior occurs. |
6939a7c
to
3057cde
Compare
Things attempted / found today:
Still think I need to figure out how to get a report of quota level stats and/or config in the registry container .... might next look and dumping the contents of |
dfdde46
to
bcdcf06
Compare
Some more updates:
Definitely feel like we've moved past my linux/AWS expertise and possibly to the point of diminishing returns in trying to uncover the Before officially punting and simply opening an issue/bugzilla to officially track this with the platform mgmt team, I'm going to try again to look at pruning images from the registry between tests and see if that keeps us under this mysterious |
Note that we run conformance tests with an XFS quota on empty dir. On May 4, 2016, at 4:42 PM, Gabe Montero [email protected] wrote: Some more updates:
Definitely feel like we've moved past my linux/AWS expertise and possibly Before officially punting and simply opening an issue/bugzilla to — |
Thanks for the confirmation. Still wonder why the 'quota' cmd returns On Wednesday, May 4, 2016, Clayton Coleman [email protected] wrote:
|
bcdcf06
to
31e2f20
Compare
Well, Group quota on /mnt/openshift-xfs-vol-dir (/dev/mapper/docker--vg-openshift--xfs--vol--dir) ec2-user 0 0 0 00 [--------] Group quota on /mnt/openshift-xfs-vol-dir (/dev/mapper/docker--vg-openshift--xfs--vol--dir) ec2-user 0 0 0 00 [--------] Running the same command from the extended test when the error is reported return nothing. @dgoodwin @smarterclayton do we perhaps have some sort of xfs_quota or docker emptydir problem? |
76db393
to
d01f79d
Compare
@bparees @dgoodwin - PTAL at the proposed changes, I'd like to get these merged and move on to the next set of issues with the build/images extended tests. On the xfs_quota related change, I backed off a small bit on the increase in the quota size (went from 256 Mi to 896 Mi), so that we don't consume the full 1 Gi amount the ec2 instances have allocated. So in theory we should still see the quota stop things before the FS is in fact fully consumed. Also, @bparees @csrwng - the 2 tests with the internal git server in gitauth.go have now failed a couple of times in a row with the same error since we moved past the disk quota issue. And these are the only build extended tests that failed. That will be the next area of focus for my debug, but I seem to recall from scrum some debug activity with the internal git server. Are my recollections correct / are there known issues that would affect these external tests? If not, some background into why the URLs are constructed the way they are might be helpful. thanks |
I can reproduce the gitauth.go extended test errors locally. |
@@ -48,25 +49,56 @@ func DumpBuildLogs(bc string, oc *CLI) { | |||
fmt.Fprintf(g.GinkgoWriter, "\n\n got error on bld logs %v\n\n", err) | |||
} | |||
|
|||
ExamineDiskUsage() | |||
// if we suspect that we are filling up the registry file syste, call ExamineDiskUsage / ExaminePodDiskUsage | |||
// also see if manipulations of the quota around /mnt/openshift-xfs-vol-dir exist in the extended test set up scripts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this would be one case where i'd leave the call in place, commented out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add calls commented out
@gabemontero one nit and then i'd be ok w/ merging this if that's what you want. |
…isk usage analysis
d01f79d
to
5be0a98
Compare
Evaluated for origin testonlyextended up to 5be0a98 |
@bparees yeah, let's go ahead and merge ... aside from wanting to see the results for the build extended tests on the official job, I'm curious on the delta behavior on the images extended tests (something seems off with my PR for the images side). We'll work these gitauth.go failures in a separate PR. The are showing up locally anyway they are just broke vs. getting impacted by the env. |
The problem with the gitserver tests is that they rely on the router and .xip.io to test ssl auth: It seems that .xip.io resolution has been flakier than usual lately |
Yep that was where I was at ... where I was headed was simply pulling the Or do we really care about the name resolution? On Fri, May 6, 2016 at 3:33 PM, Cesar Wong [email protected] wrote:
|
We need it if we want to test secured communication to the server (via SSL)... only possible through the router currently |
The router logs has this (and something similar for the gitserver-tokenauth service):
Don't know yet if this is benign or not. |
lgtm, thanks for the deep dive on this @gabemontero |
continuous-integration/openshift-jenkins/merge SUCCESS (https://ci.openshift.redhat.com/jenkins/job/merge_pull_requests_origin/5841/) (Image: devenv-rhel7_4131) |
Evaluated for origin merge up to 5be0a98 |
[Test]ing while waiting on the merge queue |
Evaluated for origin test up to 5be0a98 |
NP @bparees On the route, based on the comments on this method, we might possible expect a connection error, not a resolve host error. Going to hit pause for now, wait for this PR to merge, and then see about pulling in the networking team / brainstorming during scrum on Monday. |
@gabemontero If what you're seeing is this: |
@csrwng yep that is where I was landing wrt that certificate based log. The use of the openshift proxy server as an alternative sounds cool. Consider the baton officially passed to you then until you get a chance to investigate :-) |
continuous-integration/openshift-jenkins/testonlyextended FAILURE (https://ci.openshift.redhat.com/jenkins/job/test_pr_origin_extended/162/) (Extended Tests: core(builds)) |
continuous-integration/openshift-jenkins/test FAILURE (https://ci.openshift.redhat.com/jenkins/job/test_pr_origin/3654/) |
@gabemontero awesome work!!! Thank you! And sorry I wasn't able to get into that... |
Thanks @soltysh - and no worries - it was educational ;-) |
…isk usage analysis
@bparees @csrwng FYI - next round of debug on ext test registry disk quota error ... caught some more tests' build failures, and added the running of df/du inside the registry pod.